从串行 CPU 编程转向 GPU 编程,需要一次范式转变:从逐元素迭代转变为 基于块的执行。我们不再将数据视为标量流,而是将其看作由“块”组成的集合,这些块被调度以充分利用硬件带宽。
1. 内存受限型与计算受限型
一个内核的性能瓶颈取决于数学运算与内存访问次数的比率。 向量加法通常属于内存受限型 因为它每进行三次内存操作(两次加载,一次存储)才执行一次加法。硬件花费在等待 DRAM 数据上的时间远多于实际计算时间。
2. BLOCK_SIZE 的作用
BLOCK_SIZE 定义了并行性的粒度。如果过小,我们将无法充分利用 GPU 宽广的执行通道。一个合适的大小能确保有足够多的“飞行中工作”来饱和内存总线。
3. 通过占用率隐藏延迟
占用率 指 GPU 上活动块的数量。虽然这不是最终目标,但它允许调度器在某个块等待从显存获取高延迟数据时,切换到另一个块进行计算。
4. 硬件利用率
为了最大化性能,我们必须使我们的 BLOCK_SIZE 与 GPU 架构的内存合并规则对齐,确保连续线程访问连续的内存地址。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
For a kernel that adds two vectors ($out = x + y$), what is the most likely bottleneck on modern GPUs?
Arithmetic Throughput
Memory Bandwidth
Register Pressure
Shared Memory Latency
✅ Correct!
Vector addition involves very little math compared to the amount of data moved (3 memory ops per 1 add), making it memory-bound.❌ Incorrect
Arithmetic throughput is rarely the bottleneck for simple element-wise operations like addition.QUESTION 2
What is the primary purpose of 'Occupancy' in the GPU execution model?
To ensure every thread runs as fast as possible.
To hide memory latency by keeping work in flight.
To increase the clock speed of the compute units.
To reduce the power consumption of the HBM.
✅ Correct!
High occupancy allows the GPU to switch to active threads while others wait for data from global memory.❌ Incorrect
Occupancy doesn't change thread speed or clock frequency; it focuses on scheduler efficiency.QUESTION 3
Which of the following describes 'Memory-Bound' behavior?
The GPU is waiting for the memory bus to deliver data.
The GPU has exhausted its available VRAM.
The kernel is performing too many complex floating-point operations.
The CPU cannot launch kernels fast enough.
✅ Correct!
Memory-bound kernels are limited by the speed of data transfer from DRAM/HBM to the registers.❌ Incorrect
Exhausting VRAM is an Out-of-Memory error, not a 'memory-bound' performance bottleneck.QUESTION 4
What happens if the BLOCK_SIZE is set too small?
The kernel will fail with a memory error.
The GPU fails to utilize its wide SIMD execution lanes.
The memory bandwidth increases significantly.
Register pressure becomes too high.
✅ Correct!
Small block sizes result in underutilization because the hardware's execution units expect many threads to work in parallel.❌ Incorrect
Small block sizes actually reduce register pressure but hurt throughput.QUESTION 5
In the logistics warehouse analogy, what represents the 'Blocks'?
The individual items.
The workers.
The organized pallets.
The delivery trucks.
✅ Correct!
Organizing items into pallets (Blocks) ensures efficient transport and processing by workers (Compute Units).❌ Incorrect
The trucks represent the memory bus; the workers represent the compute units.Case Study: Bottleneck Analysis
Identifying Kernel Constraints
You are profiling three kernels: a Vector Addition kernel, a Deep Matrix Multiplication (GEMM) kernel, and a kernel that performs ReLU on a matrix. You need to categorize their bottlenecks based on hardware utilization theory.
Q
1. For each kernel (Vector Add, Matrix Multiply, 4-element Vector Add), decide whether the bottleneck is likely arithmetic throughput, memory bandwidth, or launch overhead.
Solution:
1. **Vector Addition**: Memory Bandwidth (low math-to-memory ratio). 2. **Deep Matrix Multiply**: Arithmetic Throughput (high $O(N^3)$ compute vs $O(N^2)$ memory). 3. **4-element Vector Add**: Launch Overhead (the time to start the GPU kernel outweighs the tiny workload).
1. **Vector Addition**: Memory Bandwidth (low math-to-memory ratio). 2. **Deep Matrix Multiply**: Arithmetic Throughput (high $O(N^3)$ compute vs $O(N^2)$ memory). 3. **4-element Vector Add**: Launch Overhead (the time to start the GPU kernel outweighs the tiny workload).
Q
2. Determine the bottleneck for a ReLU operation on a large matrix.
Solution:
The bottleneck for **ReLU** on a matrix is **Memory Bandwidth**. Since the operation is a simple comparison ($max(0, x)$), it is extremely computationally cheap, meaning performance is dictated by how fast the GPU can read the matrix from and write it back to global memory.
The bottleneck for **ReLU** on a matrix is **Memory Bandwidth**. Since the operation is a simple comparison ($max(0, x)$), it is extremely computationally cheap, meaning performance is dictated by how fast the GPU can read the matrix from and write it back to global memory.